[1] 1599 12
[1] "fixed.acidity" "volatile.acidity" "citric.acid"
[4] "residual.sugar" "chlorides" "free.sulfur.dioxide"
[7] "total.sulfur.dioxide" "density" "pH"
[10] "sulphates" "alcohol" "quality"
'data.frame': 1599 obs. of 12 variables:
$ fixed.acidity : num 7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
$ volatile.acidity : num 0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
$ citric.acid : num 0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
$ residual.sugar : num 1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
$ chlorides : num 0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
$ free.sulfur.dioxide : num 11 25 15 17 11 13 15 15 9 17 ...
$ total.sulfur.dioxide: num 34 67 54 60 34 40 59 21 18 102 ...
$ density : num 0.998 0.997 0.997 0.998 0.998 ...
$ pH : num 3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
$ sulphates : num 0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
$ alcohol : num 9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
$ quality : int 5 5 5 6 5 5 5 7 7 5 ...
fixed.acidity volatile.acidity citric.acid residual.sugar
Min. : 4.60 Min. :0.1200 Min. :0.000 Min. : 0.900
1st Qu.: 7.10 1st Qu.:0.3900 1st Qu.:0.090 1st Qu.: 1.900
Median : 7.90 Median :0.5200 Median :0.260 Median : 2.200
Mean : 8.32 Mean :0.5278 Mean :0.271 Mean : 2.539
3rd Qu.: 9.20 3rd Qu.:0.6400 3rd Qu.:0.420 3rd Qu.: 2.600
Max. :15.90 Max. :1.5800 Max. :1.000 Max. :15.500
chlorides free.sulfur.dioxide total.sulfur.dioxide
Min. :0.01200 Min. : 1.00 Min. : 6.00
1st Qu.:0.07000 1st Qu.: 7.00 1st Qu.: 22.00
Median :0.07900 Median :14.00 Median : 38.00
Mean :0.08747 Mean :15.87 Mean : 46.47
3rd Qu.:0.09000 3rd Qu.:21.00 3rd Qu.: 62.00
Max. :0.61100 Max. :72.00 Max. :289.00
density pH sulphates alcohol
Min. :0.9901 Min. :2.740 Min. :0.3300 Min. : 8.40
1st Qu.:0.9956 1st Qu.:3.210 1st Qu.:0.5500 1st Qu.: 9.50
Median :0.9968 Median :3.310 Median :0.6200 Median :10.20
Mean :0.9967 Mean :3.311 Mean :0.6581 Mean :10.42
3rd Qu.:0.9978 3rd Qu.:3.400 3rd Qu.:0.7300 3rd Qu.:11.10
Max. :1.0037 Max. :4.010 Max. :2.0000 Max. :14.90
quality
Min. :3.000
1st Qu.:5.000
Median :6.000
Mean :5.636
3rd Qu.:6.000
Max. :8.000
Most red wines have fixed acidity between 7.10 g/dm^3 and 9.20 g/dm^3.
Most red wines have volatile acidity between 039 g/dm^3 and 0.64 g/dm^3. There are some outliers above 1.5
FALSE TRUE
1467 132
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.000 0.090 0.260 0.271 0.420 1.000
0 0.49 0.24 0.02 0.26 0.1 0.01 0.08 0.21 0.32 0.03 0.09 0.3 0.31 0.04
132 68 51 50 38 35 33 33 33 32 30 30 30 30 29
0.4 0.42 0.39 0.12 0.22 0.25 0.2 0.23 0.33 0.06 0.34 0.44 0.48 0.07 0.18
29 29 28 27 27 27 25 25 25 24 24 23 23 22 22
0.45 0.14 0.19 0.29 0.05 0.27 0.36 0.5 0.15 0.28 0.37 0.46 0.13 0.47 0.52
22 21 21 21 20 20 20 20 19 19 19 19 18 18 17
0.17 0.41 0.11 0.43 0.38 0.53 0.66 0.35 0.51 0.54 0.55 0.68 0.63 0.16 0.57
16 16 15 15 14 14 14 13 13 13 12 11 10 9 9
0.58 0.6 0.64 0.56 0.59 0.65 0.69 0.74 0.73 0.76 0.61 0.67 0.7 0.62 0.71
9 9 9 8 8 7 4 4 3 3 2 2 2 1 1
0.72 0.75 0.78 0.79 1
1 1 1 1 1
The citric acid has three peaks around 0, 0.25 and 0.5g/dm^3. 138 red wines have 0 g/dm^3 citric acid. There is an outlier that has 1.0 g/dm^3.
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.900 1.900 2.200 2.539 2.600 15.500
The histotram of residual sugar has one peak and long-tailed. Most of red wines have residual sugar between 1.9 g/dm^3 to 2.6 g/dm^3: median 2.2g/dm^3 and mean 2.539 g/dm^3.
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.01200 0.07000 0.07900 0.08747 0.09000 0.61100
Most of red wines have chlorides between 0.07 g/dm^3 to 0.09 g/dm^3: median 0.079 g/dm^3 and mean 0.08747 g/dm^3. Transform x axis with log10, histogram of chlorides seems to have a normal distribution.
Min. 1st Qu. Median Mean 3rd Qu. Max.
1.00 7.00 14.00 15.87 21.00 72.00
Most free.sulfur.dioxide values are integers and most of them are between 7 mg/dm^3 and 21 mg/dm^3.
Min. 1st Qu. Median Mean 3rd Qu. Max.
6.00 22.00 38.00 46.47 62.00 289.00
All of total.sulfur.dioxide values are integers. Most red wines have a total.sulfur.dioxide between 22 mg/dm^3 and 62 mg/dm^3. There are some outliers above 250 mg/dm^3.
Warning: position_stack requires constant width: output may be incorrect
Warning: position_stack requires constant width: output may be incorrect
The density value seems to display a normal distribution with major values between 0.995 and 1.0.
The pH also seems to have a normal distribution. Most of red wines have a pH between 3.21 and 3.4: median 3.31 and mean 3.311.
The sulphates has outliers above 1.5 g/dm^3 and has peak around 0.6.
The alcohol varies between 8 to 14 with major peaks around 10. Most of red wines have a alcohol between 9.5 and 11.1: median 10.2 and mean 10.42.
Min. 1st Qu. Median Mean 3rd Qu. Max.
3.000 5.000 6.000 5.636 6.000 8.000
5 6 7 4 8 3
681 638 199 53 18 10
All of quality values are integers and between 3 and 8. Most of red wines have a quality between 5 and 6: median 6 and mean 5.636
[1] "3" "4" "5" "6" "7" "8"
3 4 5 6 7 8
10 53 681 638 199 18
Created “quality_class” factord variable for bi and multivariate analysis.
There are 1599 red wines and have 11 input features and 1 output feature(quality).
The main features in the data set is quality. I’d like to find which chemical properties influence the quality of red wine. I suspect alcohol and some other features can be used to build a pridictive model to quality of red wine.
Alcohol, density and pH seems to contribute the quality. I think alcohol is most significant feature because red wine is a kind of liquor.
Yes, I created “quality_class” variable for bivariate or multivariate analysis of “quality” feature.
The citric acid has three peaks around 0, 0.25 and 0.5g/dm^3. 138 red wines have 0 g/dm^3 citric acid which is the highest peak.
The chlorides is log-transformed. the transformed distribution shows normal distribution. Most of red wines have chlorides between 0.07 g/dm^3 to 0.09 g/dm^3: median 0.079 g/dm^3 and mean 0.08747 g/dm^3.
fixed.acidity volatile.acidity citric.acid
fixed.acidity 1.00000000 -0.256130895 0.67170343
volatile.acidity -0.25613089 1.000000000 -0.55249568
citric.acid 0.67170343 -0.552495685 1.00000000
residual.sugar 0.11477672 0.001917882 0.14357716
chlorides 0.09370519 0.061297772 0.20382291
free.sulfur.dioxide -0.15379419 -0.010503827 -0.06097813
total.sulfur.dioxide -0.11318144 0.076470005 0.03553302
density 0.66804729 0.022026232 0.36494718
pH -0.68297819 0.234937294 -0.54190414
sulphates 0.18300566 -0.260986685 0.31277004
alcohol -0.06166827 -0.202288027 0.10990325
quality 0.12405165 -0.390557780 0.22637251
residual.sugar chlorides free.sulfur.dioxide
fixed.acidity 0.114776724 0.093705186 -0.153794193
volatile.acidity 0.001917882 0.061297772 -0.010503827
citric.acid 0.143577162 0.203822914 -0.060978129
residual.sugar 1.000000000 0.055609535 0.187048995
chlorides 0.055609535 1.000000000 0.005562147
free.sulfur.dioxide 0.187048995 0.005562147 1.000000000
total.sulfur.dioxide 0.203027882 0.047400468 0.667666450
density 0.355283371 0.200632327 -0.021945831
pH -0.085652422 -0.265026131 0.070377499
sulphates 0.005527121 0.371260481 0.051657572
alcohol 0.042075437 -0.221140545 -0.069408354
quality 0.013731637 -0.128906560 -0.050656057
total.sulfur.dioxide density pH
fixed.acidity -0.11318144 0.66804729 -0.68297819
volatile.acidity 0.07647000 0.02202623 0.23493729
citric.acid 0.03553302 0.36494718 -0.54190414
residual.sugar 0.20302788 0.35528337 -0.08565242
chlorides 0.04740047 0.20063233 -0.26502613
free.sulfur.dioxide 0.66766645 -0.02194583 0.07037750
total.sulfur.dioxide 1.00000000 0.07126948 -0.06649456
density 0.07126948 1.00000000 -0.34169933
pH -0.06649456 -0.34169933 1.00000000
sulphates 0.04294684 0.14850641 -0.19664760
alcohol -0.20565394 -0.49617977 0.20563251
quality -0.18510029 -0.17491923 -0.05773139
sulphates alcohol quality
fixed.acidity 0.183005664 -0.06166827 0.12405165
volatile.acidity -0.260986685 -0.20228803 -0.39055778
citric.acid 0.312770044 0.10990325 0.22637251
residual.sugar 0.005527121 0.04207544 0.01373164
chlorides 0.371260481 -0.22114054 -0.12890656
free.sulfur.dioxide 0.051657572 -0.06940835 -0.05065606
total.sulfur.dioxide 0.042946836 -0.20565394 -0.18510029
density 0.148506412 -0.49617977 -0.17491923
pH -0.196647602 0.20563251 -0.05773139
sulphates 1.000000000 0.09359475 0.25139708
alcohol 0.093594750 1.00000000 0.47616632
quality 0.251397079 0.47616632 1.00000000
The alcohol and sulphates are the most correlated features with quality. The volatile.acidity is negatively correlated with quality.
As quality increases, median of alcohol tends to increse.
Call:
lm(formula = quality ~ alcohol, data = wineSubset)
Residuals:
Min 1Q Median 3Q Max
-2.8489 -0.4065 -0.1787 0.5176 2.5909
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.81782 0.17512 10.38 <2e-16 ***
alcohol 0.36646 0.01672 21.92 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.7083 on 1596 degrees of freedom
Multiple R-squared: 0.2314, Adjusted R-squared: 0.2309
F-statistic: 480.4 on 1 and 1596 DF, p-value: < 2.2e-16
The linear model of alcohol and quality has R^2 value 0.2314.
Alcohol has a negative correlation with density.
Also, alcohol has a negative correlation with total.sulfur.dioxide, free.sulfur.dioxide and chlorides.
As quality increases, median of density tends to decrease.
There seems to no strong relationship between quality and fixed.acidity.
As quality increases, median of volatile.acidity tends to decrease.
As quality increases, median of citric.acid tends to increase.
The citric acid has three peaks around 0, 0.25 and 0.5g/dm^3 at each quality.
There seems to no strong relationship between quality and residual.sugar
As quality increases, median of chlorides tends to decrease.
There seems to no relationship between quality and free.sulfur.dioxide.
There seems to no relationship between quality and total.sulfur.dioxide.
As quality increases, median of pH tends to decrease.
As quality increases, median of sulphates increase
Quality correlates with alcohol and sulphates and negatively correlated with volatile.acidity.
Citric.acid distributed with three peaks at 0, 0.25 and 0.5 g/dm^3 at each quality.
Alcohol has a negative correlation with density. Alcohol also has a negative correlation with total.sulfur.dioxide, free.sulfur.dioxide and chlorides.
The quality of red wine is positively correlated with alcohol and sulphates and negatively correlated with volatile.acidity.
As quality increases, most of alcohol increse.
Alcohol has a negative correlation with density.
There seems no specific relation with alchol and pH.
In every quality, citric.acid tends to have peaks at 0, 0.25, 0.5 g/dm^3.
Calls:
m1: lm(formula = quality ~ alcohol, data = wineSubset)
m2: lm(formula = quality ~ alcohol + volatile.acidity, data = wineSubset)
m3: lm(formula = quality ~ alcohol + volatile.acidity + sulphates,
data = wineSubset)
===============================================
m1 m2 m3
-----------------------------------------------
(Intercept) 1.818*** 3.038*** 2.547***
(0.175) (0.185) (0.196)
alcohol 0.366*** 0.319*** 0.315***
(0.017) (0.016) (0.016)
volatile.acidity -1.384*** -1.221***
(0.095) (0.097)
sulphates 0.685***
(0.100)
-----------------------------------------------
R-squared 0.231 0.322 0.341
adj. R-squared 0.231 0.321 0.340
sigma 0.708 0.666 0.656
F 480.388 378.330 274.938
p 0.000 0.000 0.000
Log-likelihood -1715.379 -1615.409 -1592.411
Deviance 800.741 706.568 686.521
AIC 3436.757 3238.818 3194.823
BIC 3452.887 3260.324 3221.705
N 1598 1598 1598
===============================================
This linear model has R^2 value 0.341. Used three highest absolute correlated variables with quality such as alcohol, volatile.acidity and sulphates.
There is a negative correlateion between alcohol and density feature. It can be seen in every quality in red wine data set.
Despite of low R-squared value of 0.341, I built a linear model that pridcit the quality of red wine.
From the above the scatter plot of alcohol and pH, there seems to be any specific relations between alcohol and pH.
Yes, I created a linear model using quality and alcohol, volatile.acidity and sulphates. The variables are selected by absolute value of correlation of quality.
The variables in this linear model can account for 34.1% of the variance in the quality of red wines.
The citric acid has three peaks around 0, 0.25 and 0.5 g/dm^3. 138 red wines have 0 g/dm^3 citric acid which is the highest peak.
The quality of red wine is correlated with alcohol and sulphates and negatively correlated with volatile.sulphates.
Alcohol has a negative correlation with density. Regardless of quality, density is negatively correlated with alcohol.
The data set contains 1599 red variants of the Portuguese “Vinho Verde” wine. I started by understanding the individual variables in the data set, and I was interested in “alcohol” feature because wine is a kind of liquor.
During the exploring data set, I found interesting distribution with “citric.acid”. It has three peaks around 0, 0.25 and 0.5g/dm^3. About 9% of red wine has 0 “citric.acid”.
As I expected, the most correlated feature of quality is “alcohol” and there are another features that has relation with quality. “volatile.acidity” is also correlated with quality and “sulphates” is negatively correlated. The linear model with only “alcohol” variable has 0.231 R-sqaured value. By adding “volatile.acidity” and “sulphates”, R-squared value is increased with 0.341.
“alcohol” is negatively correlated with “density” regardless of quality. Percent of “alcohol” is increased, “density” is decreased.
Since the data set consists of samples from the specific red wine mentioned above, there is a limitation of this analysis. It might be interesting to obtain data set from various regions to eliminate any bias created by various products.